General Overview

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.


Metadata structure

  • Metadata collected at different stages are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.


Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.



Explore SRA metadata

Downloading SRA metadata

  • Via SRA Run Selector
  • Via Entrez Direct
  • Using pysradb

For demo: We will explore more on sample metadata retrieved from four randomly selected microbiome BioProjects, including:

  1. PRJNA477349: 16S: rRNA from bushmeat samples collected from Tanzania Metagenome
  2. PRJNA802976: 16S: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants
  3. PRJNA685168: WGS: Multi-omics suggest diverse mechanisms for response to biologic therapies in IBD
  4. PRJEB21612: WGS: Alterations of the gut microbiome in hypertension

Using SRA Run Selector

Metadata associated with a specific project can be retrieved manually via the SRA Run Selector or using the Entrez Direct (edirect) scipts.

  • Note that the SRA filename for metadata is automatically named SraRunTable.txt, but for clarity we will provide a filename corresponding to the NCBI-BioProject ID with .CSV extension.
  • We will save the metadata file in data/metdata/ folder.


Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349


Explore top 1and bottom columns in each project

                     .
1        run_accession
2      study_accession
3          study_title
4 experiment_accession
5     experiment_title
                  .
53 ena_fastq_http_1
54 ena_fastq_http_2
55    ena_fastq_ftp
56  ena_fastq_ftp_1
57  ena_fastq_ftp_2
[1] "There are 133 rows and 57 columns in PRJNA477349 metadata"
                     .
1        run_accession
2      study_accession
3          study_title
4 experiment_accession
5     experiment_title
                  .
52 ena_fastq_http_1
53 ena_fastq_http_2
54    ena_fastq_ftp
55  ena_fastq_ftp_1
56  ena_fastq_ftp_2
[1] "There are 54 rows and 56 columns in PRJNA802976 metadata"
                     .
1        run_accession
2      study_accession
3          study_title
4 experiment_accession
5     experiment_title
                  .
74 ena_fastq_http_1
75 ena_fastq_http_2
76    ena_fastq_ftp
77  ena_fastq_ftp_1
78  ena_fastq_ftp_2
[1] "There are 114 rows and 78 columns in PRJNA685168 metadata"
                     .
1        run_accession
2      study_accession
3          study_title
4 experiment_accession
5     experiment_title
                  .
65 ena_fastq_http_1
66 ena_fastq_http_2
67    ena_fastq_ftp
68  ena_fastq_ftp_1
69  ena_fastq_ftp_2
[1] "There are 117 rows and 69 columns in PRJEB21612 metadata"



Downloading via Entrez SRA runinfo

  • The Entrez direct functionalities provide uniform 47 columns for each bioproject.


Note: Full metadata, which is bioproject-specific, can manually be downloaded from the SRA database using the RunSelector option as described above [#runselector]:



Graphical exploration

Demo with PRJNA477349 metadata

The PRJNA477349 contains latitudes and longitudes information which will enable dropping pins on collection sites.


Frequency of variables



Sampling points



Demo with PRJNA685168 metadata

The PRJNA685168 is an IBD study in relation to responses to biologic therapies, it contains sex and age features.


Frequency of variables



Demo with PRJNA802976 metadata

The PRJNA802976 is gut microbiota study in relation to changes following systemic Antibiotic Administration in Infants.


Sampling points



Demo with PRJEB21612 metadata

The PRJEB21612 is an hypertension study in relation alterations of the gut microbiome.


Sampling points




References

[1]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

Screenshot of interactive snakemake report

The interactive snakemake html report can be viewed by opening the report.html using any compartible browser. You will be able to explore the workflow and the associated statistics. You will also be able to close the left bar to get a better wider view of the display.

Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer